AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
# pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# to split the data into train and test sets
from sklearn.model_selection import train_test_split
# to build a linear regression model
from sklearn.linear_model import LinearRegression
# to check a regression model's performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to tune different models
from sklearn.model_selection import GridSearchCV
# to compute classification metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
recall_score,
precision_score,
f1_score,
)
# Libraries for scaling numerical features
from sklearn.preprocessing import StandardScaler
# to perform k-means clustering
from sklearn.cluster import KMeans
# to perform silhouette analysis
from sklearn.metrics import silhouette_score
# to perform t-SNE
from sklearn.manifold import TSNE
# to define a common seed value to be used throughout
RS=42
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# Load the dataset from my google drive
from google.colab import drive
drive.mount('/content/drive')
# load data into a pandas dataframe
loan_modelling_orig = pd.read_csv("/content/drive/MyDrive/AIML/Project2 Personal Loan/Loan_Modelling.csv")
# creating a copy of the data
data = loan_modelling_orig.copy()
data.head()
data.tail()
data.shape
data.info()
data.describe(include="all").T
The Mean and Median age are both at or around 45 years which
The min value for Experience is (-3) which is impossible. This warrants futher analysis and a count of values less than zero. Most likely they will be flipped to positive values or zero.
Median income is \$64k however with mean income just under \$74k this is a right-skewed distribution which makes sense being that there is a logical floor of 0 and potentially limitless income levels.
Fewer than 50% of customers have a mortgage
Fewer than 50% of customers have a credit card by another other bank.
Fewer than 75% have a personal loan
Fewer than 75% have a securities account.
At least 50% use online banking.
All have complete a bachelors degree
At least 75% earning an advanced or professional degree
At least 25% are single
At least 75% being married with children.
The mean spending on credit cards is \$1937.94 however, the standard deviation is \$1747.66 which is 90% of the mean.
# Filter for negative 'Experience' values
negative_experience_values = data['Experience'][data['Experience'] < 0]
# Get the total number of occurrences of each negative value
negative_experience_counts = negative_experience_values.value_counts()
# Display the result
print(negative_experience_counts)
NOTE: This was only done on the 'data' copy, not the original .csv
# Convert negative values in 'Experience' to positive
data['Experience'] = data['Experience'].abs()
# Verify the changes
print(data['Experience'])
data.describe(include="all").T
data.isnull().sum()
data.duplicated().sum()
Questions:
Group variables by type of data into tuples for more efficient analysis
# Separate columns into continuous/numerical variables and categorical/ordinal/dichotomous
# Continuous data analysis
continuous_vars = ('Income', 'Age', 'Experience', 'CCAvg', 'Mortgage')
# Categorical data analysis
categorical_vars = ('ZIPCode')
# Binary categorical data analysis
binary_vars = ('Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard')
Visualize and analyze the distribution of continuous variables
# defining the figure size
plt.figure(figsize=(12, 10))
# Iterate over the continuous variables
for i, feature in enumerate(continuous_vars):
# Historgram
plt.subplot(len(continuous_vars), 2, 2*i+1) # assign a subplot in the main plot
sns.histplot(data=data, x=feature) # plot the histogram
plt.title(f'Histogram of {feature}') # add a title
# Box Plot
plt.subplot(len(continuous_vars), 2, 2*i+2) # assign subplot for box plot
sns.boxplot(data=data, x=feature) # plot the boxplot
plt.title(f'Box Plot of {feature}') # add a title
plt.tight_layout(); # to add spacing between plots
plt.show()
Investigate Income, CCAvg, and Mortgage Outliers
# 'Income'
print('-INCOME-')
# perform calculations on 'Income' outliers
upper_whisker_inc = data['Income'].quantile(0.75) + (1.5*(data['Income'].quantile(0.75) - data['Income'].quantile(0.25)))
print('Income upper_whisker ==', upper_whisker_inc)
# create dataframes for high and non-high income customers
high_income_customers = data['Income'][data['Income'] > upper_whisker_inc]
not_high_income_customers = data['Income'][data['Income'] <= upper_whisker_inc]
# count the number of customers income above upper_whisker
print('number of high income customers (above upper_whisker) ==',len(high_income_customers))
# calculate income data
print('total income of high income customers ==', high_income_customers.sum())
print('total income of non-high income customers ==', not_high_income_customers.sum())
print('percent of total income from high income customers', 100*(high_income_customers.sum()/not_high_income_customers.sum()))
print('\n')
# 'CCAvg'
print('-CCAvg-')
# perform calculations on 'CCAvg' outliers
upper_whisker_cc = data['CCAvg'].quantile(0.75) + (1.5*(data['CCAvg'].quantile(0.75) - data['CCAvg'].quantile(0.25)))
print('CCAvg upper_whisker ==', upper_whisker_cc)
# create dataframes for high and non-high CCAvg customers
high_cc_customers = data['CCAvg'][data['CCAvg'] > upper_whisker_cc]
not_high_cc_customers = data['CCAvg'][data['CCAvg'] <= upper_whisker_cc]
zero_cc_customers = data['CCAvg'][data['CCAvg'] == 0]
# count the number of customers CCAvg above upper_whisker
print('number of high cc customers (above upper_whisker) ==',len(high_cc_customers))
# calculate CCAvg data
print('number of customers with CC ==',(data.shape[0] - len(zero_cc_customers)))
print('number of customers with no CC ==', len(zero_cc_customers))
print('percent of customers with zero CC ==', 100*(len(zero_cc_customers)/data.shape[0]))
print('total credit spending of high CCAvg customers ==', high_cc_customers.sum())
print('total credit spending of non-high CCAvg customers ==', not_high_cc_customers.sum())
print('percent of total credit spending from high CCAvg customers ==', 100*(high_cc_customers.sum()/not_high_cc_customers.sum()))
print('\n')
# 'Mortgage'
print('-MORTGAGE-')
# perform calculations on 'Mortgage' outliers
upper_whisker_mort = data['Mortgage'].quantile(0.75) + (1.5*(data['Mortgage'].quantile(0.75) - data['Mortgage'].quantile(0.25)))
print('Mortgage upper_whisker ==', upper_whisker_mort)
# create dataframes for high, non-high, and zero Mortgage customers
high_mort_customers = data['Mortgage'][data['Mortgage'] > upper_whisker_mort]
not_high_mort_customers = data['Mortgage'][data['Mortgage'] <= upper_whisker_mort]
zero_mort_customers = data['Mortgage'][data['Mortgage'] == 0]
# count the number of customers Mortgage above upper_whisker
print('number of high Mortgage customers (above upper_whisker) ==',len(high_mort_customers))
# count the number of customers Mortgage equal to zero
print('number of customers without a mortgage ==',len(zero_mort_customers))
# calculate Mortgage data
print('total Mortgage balances of high Mortgage customers ==', high_mort_customers.sum())
print('total Mortgage balances of non-high Mortgage customers ==', not_high_mort_customers.sum())
print('percent of total Mortgage balances from high Mortgage customers ==', 100*(high_mort_customers.sum()/not_high_mort_customers.sum()))
print('percent of customers with zero Mortgage ==', 100*(len(zero_mort_customers)/data.shape[0]))
Income is right-skewed with most customers earning between \$30k–\$100k and 96 customers have income above the upper whisker value of \$186.5k. They account for less than 2% of customers and 5.3% of all customer income. Scaling will definitely be required to nullify the effects of high income customers on clustering.
Age is fairly uniformly distributed from ~25 to ~65. No obvious skew.
Experience is fairly uniform with small dips and is consistent with Age. Might be correlated with Age. Need to watch for multicollinearity.
CCAvg is strongly right-skewed. 4894 or 97.8% of people have a CC. Most people spend under \$2k/month. A few spend up to \$10k but this presents a long right whisker, and nearly 6.5% of the total number of customers are outliers accounting for nearly 23% of all CC spending. This skew could heavily influence clustering or tree splits and will need to be addressed with scaling.
There is a large group of non-mortgage holders 69.2% (possibly renters or fully paid homeowners) and a small segment of 291 high-value mortgage customers (considered outliers over \$252.5k) that are responsible for nearly 58% mortgage exposure. This group may warrant special attention or segmentation for marketing, risk, or retention strategies.
Analyze ZIPcode
ZIPCode_counts = data['ZIPCode'].value_counts()
ZIPCode_summary = pd.DataFrame({
'ZIPCode': ZIPCode_counts.index,
'total_customers': ZIPCode_counts.values,
'percent_of_customers': (ZIPCode_counts.values / data.shape[0]*100).round(2)
})
# Sort the DataFrame by 'total_customers' in descending order
ZIPCode_summary = ZIPCode_summary.sort_values(by='total_customers', ascending=False)
print(ZIPCode_summary.head(20))
# Plotting
plt.figure(figsize=(10, 6))
plt.title('Barplot: Number of Customers per ZIPCode')
plt.ylim(0, 200)
plt.xlabel('ZIPCode')
plt.ylabel('Number of Customers')
plt.xticks(ticks=range(len(ZIPCode_summary)), labels=[''] * len(ZIPCode_summary))
sns.barplot(x='ZIPCode', y='total_customers', data=ZIPCode_summary, order=ZIPCode_summary['ZIPCode'])
plt.show()
Analyze Binary Variables
# Create an empty list to store the summary for each binary variable
bin_summary_list = []
# Loop over each binary variable to calculate sums and percentages
for var in binary_vars:
counts = data[var].value_counts() # Get the counts of 0s and 1s
total = len(data[var]) # Total number of entries in the column
bin_summary_list.append({
'Variable': var,
'Sum of 0s': counts.get(0, 0), # Defaulting to 0 if not present
'Sum of 1s': counts.get(1, 0), # Defaulting to 0 if not present
'Percentage of 0s': (counts.get(0, 0) / total * 100),
'Percentage of 1s': (counts.get(1, 0) / total * 100)
})
# Convert the list of dictionaries into a DataFrame
bin_summary_df = pd.DataFrame(bin_summary_list)
# Display the summary DataFrame
bin_summary_df.head()
Heatmap
# defining the size of the plot
plt.figure(figsize=(12, 7))
# plotting the heatmap for correlation
sns.heatmap(
data.corr(),annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
Notable Personal Loan Correlations:
Other Strong positive correlations:
Other Negative correlations:
Notable Low/No-Correlations:
Based on this, it would appear that customers with higher income, higher credit card spending, existing CD accounts, advanced education, and mortgages are more likely to accept personal loans. We should continue bivariate analysis across those variables.
Pairplot
# Select the columns for the pairplot
columns = ['Age', 'Personal_Loan', 'Income', 'CD_Account', 'Education', 'CCAvg', 'Mortgage']
pp_data = data[columns]
sns.pairplot(pp_data, hue='Personal_Loan', diag_kind='hist')
# Display the plot
plt.show()
Income and CCAvg: There is a noticeable concentration of customers with Personal Loans with higher Income (above ~\$100k) and higher CCAvg (above ~\$4k/month), indicating that customers with higher income and credit card spending are more likely to accept loans.
Education: The scatter plots show a slight tendency for customers with personal loans to have higher Education levels (2 or 3), aligning with the earlier correlation (0.14).
CD_Account and Securities_Account: Customers with CD accounts or Securities accounts tend to have a higher likelihood of accepting personal loans, especially when combined with higher Income or CCAvg.
Mortgage: There is a spread of personal loan customers across various Mortgage values, with a slight lean toward higher values, consistent with the 0.14 correlation.
Age itself does not show a strong direct correlation with most variables or Personal_Loan. The strongest pattern is the slight increase in loan acceptance among older customers (40–60) with higher Income, CCAvg, Mortgage, CD_Account, and Securities_Account however, I attribute that more to the other variables than age alone.
Overall, targeting customers with high Income, high CCAvg, existing CD/Securities accounts, and higher Education levels could improve loan conversion rates.
Examine Personal_Loan against several highly correlated non-binary variables
#Personal_Loan and Income (0.50)
#Personal_Loan and CCAvg (0.37)
#Personal_Loan and CD_Account (0.32) -- binary, not used here
#Personal_Loan and Education (0.14)
#Personal_Loan and Mortgage (0.14)
high_corr_vars = ['Income', 'CCAvg', 'Education', 'Mortgage']
# Set the size of the entire figure
plt.figure(figsize=(15, 10))
# Loop over each variable and create a box plot
for i, hcv in enumerate(high_corr_vars, 1): # Start indexing from 1 for plt.subplot
plt.subplot(2,2, i) # Create a subplot in a 2x2 grid
sns.boxplot(data=data, x='Personal_Loan', y=hcv)
plt.title(f'Boxplot - {hcv}')
# If there are any remaining subplot positions, leave them empty
# Adjust layout
plt.tight_layout()
plt.show()
Income: Customers with a personal loan have a higher median income (~\$150k) compared to customers that do not have a personal loan (\~\$100k). Both groups have outliers above \$200k, but the spread is wider for loan acceptors, suggesting higher income correlates with loan uptake.
CCAvg: Median credit card spending is more than double for customers with a personal loan, around (~$4k) than customers without a personal loan (~2k). Significant outliers (up to \$10k) exist in both groups, with more extreme values for loan acceptors, indicating higher spending may drive interest in personal loans.
Education: Medians are similar (~2.0–2.5) for both groups, with customers that have personal loans being slightly higher. No outliers indicating education level has a mild positive association with loan acceptance.
Mortgage: Customers without a personal loan have a lower median mortgage (~100k) with many outliers up to 600k. Customers with a personal loan have a higher median mortgage (~200k) and fewer extreme outliers, suggesting customers with moderate to high mortgages are more likely to take loans.
Overall: Customers with higher Income, CCAvg, CD_Account presence, slightly higher Education, and moderate to high Mortgages are more likely to accept personal loans, supporting targeted marketing toward these segments
# Drop the ID and Experience columns
slimData = data.drop(['ID', 'Experience', 'ZIPCode'], axis=1)
# Create slimCapData by copying slimData
slimCapData = slimData.copy()
# Apply capping to 'Income', 'CCAvg', and 'Mortgage' at the 95th percentile
for col in ['Income', 'CCAvg', 'Mortgage']:
upper_limit = slimCapData[col].quantile(0.95)
slimCapData[col] = slimCapData[col].clip(upper=upper_limit)
# Verify the change (optional)
print(slimCapData[['Income', 'CCAvg', 'Mortgage']].describe())
print(data[['Income', 'CCAvg', 'Mortgage']].describe())
slimCapData heatmap
# defining the size of the plot
plt.figure(figsize=(12, 7))
# produce heatmap with new slimCapData
sns.heatmap(
slimCapData.corr(),annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
# Original data correlations
# Personal_Loan and Income (0.50)
# Personal_Loan and CCAvg (0.37)
# Personal_Loan and CD_Account (0.32) -- binary, not used here
# Personal_Loan and Education (0.14)
# Personal_Loan and Mortgage (0.14)
# Initialize the StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the numerical columns, transform them (i.e., execute the scaling), and create a new dataframe with scaled data
scaled_slim_data = pd.DataFrame(scaler.fit_transform(slimData))
# Display the scaled data
scaled_slim_data.head()
scaled_slim_data.describe()
# Initialize the StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the numerical columns, transform them (i.e., execute the scaling), and create a new dataframe with scaled data
scaled_cap_data = pd.DataFrame(scaler.fit_transform(slimCapData))
# Display the scaled data
scaled_cap_data.head()
scaled_cap_data.describe()
# define the explanatory (independent) and response (dependent) variables
# from the slimData and slimCapData dataframes, create new dataframes '...X' but drop 'Personal_Loan'
slimX = slimData.drop('Personal_Loan', axis=1)
capX = slimCapData.drop('Personal_Loan', axis=1)
# create new dataframes '...Y' that contain only the 'Personal_Loan' column
slimY = slimData['Personal_Loan']
capY = slimCapData['Personal_Loan']
# splitting the data in 80:20 ratio for train and test sets
# ...X_train, ...X_test, ...y_train, ..y_test = train_test_split(X, y, test_size=0.20,random_state=RS)
# random_state, RS - using the same seed value (42) to ensure running the code again will produce the same split
slimX_train, slimX_test, slimY_train, slimY_test = train_test_split(
slimX, # specifying the independent variables 'The Feature Matrix' that contains independent variables
slimY, # specifying the dependent variable 'The Target Vector' that contains dependent variables
test_size=0.20, # specifying the size of the test set as a fraction of the whole data
random_state=RS # specifying a seed value to enable reproducible results
)
capX_train, capX_test, capY_train, capY_test = train_test_split(
capX, # specifying the independent variables 'The Feature Matrix' that contains independent variables
capY, # specifying the dependent variable 'The Target Vector' that contains dependent variables
test_size=0.20, # specifying the size of the test set as a fraction of the whole data
random_state=RS # specifying a seed value to enable reproducible results
)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def plot_confusion_matrix(model, predictors, target, title="Confusion Matrix"):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predict the target values using the provided model and predictors
y_pred = model.predict(predictors)
# Compute the confusion matrix comparing the true target values with the predicted values
cm = confusion_matrix(target, y_pred)
# Create labels for each cell in the confusion matrix with both count and percentage
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2) # reshaping to a matrix
# Set the figure size for the plot
plt.figure(figsize=(6, 4))
# Plot the confusion matrix as a heatmap with the labels
sns.heatmap(cm, annot=labels, fmt="")
# Add a title to the plot
plt.title(title)
# Add a label to the y-axis
plt.ylabel("True label")
# Add a label to the x-axis
plt.xlabel("Predicted label")
# creating an instance of the decision tree model
slimDtree1 = DecisionTreeClassifier(random_state=RS)
capDtree1 = DecisionTreeClassifier(random_state=RS)
# fitting the model to the training data
slimDtree1.fit(slimX_train, slimY_train)
capDtree1.fit(capX_train, capY_train)
# Evaluate slimData training set
perf_slim = model_performance_classification(slimDtree1, slimX_train, slimY_train)
print("slimData Training Performance:\n", perf_slim)
plot_confusion_matrix(slimDtree1, slimX_train, slimY_train, "Slim Training Confusion Matrix")
# Evaluate slimCapData training set
perf_cap = model_performance_classification(capDtree1, capX_train, capY_train)
print("slimCapData Training Performance:\n", perf_cap)
plot_confusion_matrix(capDtree1, capX_train, capY_train, "Cap Training Confusion Matrix")
# Evaluate testing data sets
slimDtree1_test_perf = model_performance_classification(slimDtree1, slimX_test, slimY_test)
capDtree1_test_perf = model_performance_classification(capDtree1, capX_test, capY_test)
print("slimData Testing Performance:\n", slimDtree1_test_perf)
print("slimCapData Testing Performance:\n", capDtree1_test_perf)
plot_confusion_matrix(slimDtree1, slimX_test, slimY_test, "Slim Testing Confusion Matrix")
plot_confusion_matrix(capDtree1, capX_test, capY_test, "Cap Testing Confusion Matrix")
# list of feature names in X_train
# initial_feature_names = list(slimX_train.columns)
# set the figure size for the plot
plt.figure(figsize=(20, 20))
# plotting the decision tree
out = tree.plot_tree(
slimDtree1, # decision tree classifier model
feature_names=slimX_train.columns, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
# define the parameters of the tree to iterate over
#max_depth_values = np.arange(2, 11, 2)
#max_leaf_nodes_values = np.arange(10, 51, 10)
#min_samples_split_values = np.arange(10, 51, 10)
max_depth_values = np.arange(2, 7, 1)
max_leaf_nodes_values = np.arange(5, 21, 5)
min_samples_split_values = np.arange(5, 26, 5)
# initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
# iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
random_state=RS
)
# fit the model to the training data
estimator.fit(slimX_train, slimY_train)
# make predictions on the training and test sets
y_train_pred = estimator.predict(slimX_train)
y_test_pred = estimator.predict(slimX_test)
# calculate F1 scores for training and test sets
train_f1_score = f1_score(slimY_train, y_train_pred)
test_f1_score = f1_score(slimY_test, y_test_pred)
# calculate the absolute difference between training and test F1 scores
score_diff = abs(train_f1_score - test_f1_score)
# update the best estimator and best score if the current one has a smaller score difference
if score_diff < best_score_diff:
best_score_diff = score_diff
best_estimator = estimator
# creating an instance of the best model
dtree2 = best_estimator
# fitting the best model to the training data
dtree2.fit(slimX_train, slimY_train)
# Evaluate slimData training set
perf_dtree2 = model_performance_classification(dtree2, slimX_train, slimY_train)
print("dtree2 Training Performance:\n", perf_dtree2)
plot_confusion_matrix(dtree2, slimX_train, slimY_train, "dtree2 Training Confusion Matrix")
# Evaluate testing data sets
dtree2_test_perf = model_performance_classification(dtree2, slimX_test, slimY_test)
print("dtree2 Testing Performance:\n", dtree2_test_perf)
plot_confusion_matrix(dtree2, slimX_test, slimY_test, "dtree2 Testing Confusion Matrix")
# list of feature names in X_train
# pre_feature_names = list(slimX_train.columns)
# set the figure size for the plot
plt.figure(figsize=(12, 12))
# plotting the decision tree
out = tree.plot_tree(
dtree2, # decision tree classifier model
feature_names=slimX_train.columns, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
# printing a text report showing the rules of a decision tree
print(
tree.export_text(
dtree2, # specify the model
feature_names=slimX_train.columns, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
# Create an instance of the decision tree model
clf = DecisionTreeClassifier(random_state=RS)
# Compute the cost complexity pruning path for the model using the training data
path = clf.cost_complexity_pruning_path(slimX_train, slimY_train)
# Extract the array of effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)
# Extract the array of total impurities at each alpha along the pruning path
impurities = path.impurities
pd.DataFrame(path).head()
# Create a figure
fig, ax = plt.subplots(figsize=(10, 5))
# Plot the total impurities versus effective alphas, excluding the last value,
# using markers at each data point and connecting them with steps
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
# Set the x-axis label
ax.set_xlabel("Effective Alpha")
# Set the y-axis label
ax.set_ylabel("Total impurity of leaves")
# Set the title of the plot
ax.set_title("Total Impurity vs Effective Alpha for training set");
ccp_alphas is the alpha value that prunes the whole tree, leaving the corresponding tree with one node.# Initialize an empty list to store the decision tree classifiers
clfs = []
# Iterate over each ccp_alpha value extracted from cost complexity pruning path
for ccp_alpha in ccp_alphas:
# Create an instance of the DecisionTreeClassifier
clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=RS)
# Fit the classifier to the training data
clf.fit(slimX_train, slimY_train)
# Append the trained classifier to the list
clfs.append(clf)
# Print the number of nodes in the last tree along with its ccp_alpha value
print(
"Number of nodes in the last tree is {} with ccp_alpha {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
# Remove the last classifier and corresponding ccp_alpha value from the lists
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
# Extract the number of nodes in each tree classifier
node_counts = [clf.tree_.node_count for clf in clfs]
# Extract the maximum depth of each tree classifier
depth = [clf.tree_.max_depth for clf in clfs]
# Create a figure and a set of subplots
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
# Plot the number of nodes versus ccp_alphas on the first subplot
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs Alpha")
# Plot the depth of tree versus ccp_alphas on the second subplot
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs Alpha")
# Adjust the layout of the subplots to avoid overlap
fig.tight_layout()
train_f1_scores = [] # Initialize an empty list to store F1 scores for training set for each decision tree classifier
# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
# Predict labels for the training set using the current decision tree classifier
pred_train = clf.predict(slimX_train)
# Calculate the F1 score for the training set predictions compared to true labels
f1_train = f1_score(slimY_train, pred_train)
# Append the calculated F1 score to the train_f1_scores list
train_f1_scores.append(f1_train)
test_f1_scores = [] # Initialize an empty list to store F1 scores for test set for each decision tree classifier
# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
# Predict labels for the test set using the current decision tree classifier
pred_test = clf.predict(slimX_test)
# Calculate the F1 score for the test set predictions compared to true labels
f1_test = f1_score(slimY_test, pred_test)
# Append the calculated F1 score to the test_f1_scores list
test_f1_scores.append(f1_test)
# Create a figure
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("Alpha") # Set the label for the x-axis
ax.set_ylabel("F1 Score") # Set the label for the y-axis
ax.set_title("F1 Score vs Alpha for training and test sets") # Set the title of the plot
# Plot the training F1 scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, train_f1_scores, marker="o", label="training", drawstyle="steps-post")
# Plot the testing F1 scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, test_f1_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend(); # Add a legend to the plot
# creating the model where we get highest test F1 Score
index_best_model = np.argmax(test_f1_scores)
# selcting the decision tree model corresponding to the highest test score
postPruneDTree = clfs[index_best_model]
print(postPruneDTree)
# list of feature names in X_train
# post_feature_names = list(slimCapData.columns)
# set the figure size for the plot
plt.figure(figsize=(13, 14))
# plotting the decision tree
out = tree.plot_tree(
postPruneDTree, # decision tree classifier model
feature_names=slimX_train.columns, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9, # font size for the node text
node_ids=False, # do not show the ID of each node
class_names=None, # whether or not to display class names
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black") # set arrow color to black
arrow.set_linewidth(1) # set arrow linewidth to 1
# displaying the plot
plt.show()
# printing a text report showing the rules of a decision tree
print(
tree.export_text(
postPruneDTree, # specify the model
feature_names=slimX_train.columns, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
print("Post Pruning Training Performance")
plot_confusion_matrix(postPruneDTree, slimX_train, slimY_train)
postPruneDTree_train_perf = model_performance_classification(
postPruneDTree, slimX_train, slimY_train
)
postPruneDTree_train_perf
print("Post Pruning Testing Performance")
plot_confusion_matrix(postPruneDTree, slimX_test, slimY_test)
postPruneDTree_test_perf = model_performance_classification(
postPruneDTree, slimX_test, slimY_test
)
postPruneDTree_test_perf
Post Pruning Conclusion
True Positives (Loan Takers Correctly Identified):
False Negatives (Missed Loan Opportunities):
False Positives (Wasted Marketing Spend):
True Negatives
Practical Business Implications:
Using the scaled datasets established in preprocessing
Will begin analysis and processing with both, if there appears no noticeable difference in clustering, we will stick with scaled_slim_data
# Initiating the t-SNE object
# n_components=2: we will reduce the data to 2 dimensions
# n_jobs=-2 specifies to use all but one processor core for parallel computation, which speeds up the process
tsne2 = TSNE(n_components=2, n_jobs=-2, random_state=RS)
# Performing dimensionality reduction on the scaled data
# fit the t-SNE model to the two datasets and transform them into the specified number of dimensions
tsne2_reduced_slim_data = tsne2.fit_transform(scaled_slim_data)
tsne2_reduced_cap_data = tsne2.fit_transform(scaled_cap_data)
# Create DataFrames from the reduced datasets
tsne_2d_slim_data = pd.DataFrame(tsne2_reduced_slim_data, columns=(["Feature 1","Feature 2"])) # This DataFrame will have two columns corresponding to the two reduced dimensions
tsne_2d_cap_data = pd.DataFrame(tsne2_reduced_cap_data, columns=(["Feature 1","Feature 2"])) # This DataFrame will have two columns corresponding to the two reduced dimensions
Create and compare Scatterplots for both new t-SNE datasets
# defining the figure size
plt.figure(figsize=(12, 10))
tsne_dataframes = ['tsne_2d_slim_data', 'tsne_2d_cap_data']
# Iterate over the tsne datasets
for i in tsne_dataframes:
# Historgram
plt.subplot(2, 2, 1) # assign subplot in the main plot
sns.scatterplot(tsne_2d_slim_data, x="Feature 1", y="Feature 2");
plt.title('tsne_2d_slim_data') # add title
# Box Plot
plt.subplot(2, 2, 2) # assign subplot in the main plot
sns.scatterplot(tsne_2d_cap_data, x="Feature 1", y="Feature 2");
plt.title('tsne_2d_cap_data') # add title
plt.tight_layout(); # to add spacing between plots
plt.show()
# Define the list of perplexity values to iterate over
#
# NOTE: as mentioned below, this code block took over 5 minutes to execute
# In the interest of saving time for both myself and anyone who reviews this
# I commented out the actual array of perpexity values that were used in my analysis
#
# perplexities = [5, 10, 20, 40, 50, 75, 100, 150]
#
# I left an active variable with only the perplexity=100 value that was eventually choosen.
# The image below is of the output of the original analysis with the 8 perplexity values that we examined
# Please do NOT double-click it in a .ipynb file as it will open an enormous encrypted file
perplexities = [100]
# plt.figure(figsize=(20, 15))
plt.figure(figsize=(15, 10))
# Iterate over each perplexity value
for i in range(len(perplexities)):
# Initiate TSNE with the current perplexity value
# n_jobs specifies the nunmber of cores to use for parallel computation; -2 means use all but 1 core
tsne = TSNE(n_components=2, perplexity=perplexities[i], n_jobs=-2, random_state=RS)
# fit_transform() fits the TSNE model to the data and transforms it into the specified number of dimensions
X_cap_red = tsne.fit_transform(scaled_cap_data)
# creating a new dataframe with reduced dimensions
red_cap_data_df = pd.DataFrame(X_cap_red, columns=["Feature 1", "Feature 2"])
# Adjust the subplot grid to 2x4
plt.subplot(2, 4, i + 1)
plt.title("perplexity=" + str(perplexities[i])) # setting plot title
sns.scatterplot(data=red_cap_data_df, x="Feature 1", y="Feature 2")
plt.tight_layout(pad=2)
plt.show()
---> NOTE: DO NOT DOUBLE-CLICK ON THE IMAGE BELOW <---
---> NOTE: DO NOT DOUBLE-CLICK ON THE IMAGE ABOVE <---
# Initiate the TSNE object and set output dimension to 3
# n_jobs=-2 specifies to use all but one core for parallel computation, which speeds up the process
tsne3 = TSNE(n_components=3, perplexity=100, n_jobs=-2, random_state=RS)
# Performing dimensionality reduction on the scaled data
# fit_transform() fits the TSNE model to the data and transforms it into the specified number of dimensions
tsne3_reduced_data = tsne3.fit_transform(scaled_cap_data)
# Creating a DataFrame from the reduced data
tsne_3d_data = pd.DataFrame(tsne3_reduced_data, columns=(["Feature 1","Feature 2","Feature 3"])) # This DataFrame will have three columns corresponding to the three reduced dimensions
# plotting the 3D scatterplot
fig = px.scatter_3d(tsne_3d_data, x='Feature 1', y='Feature 2', z='Feature 3', size_max=1, opacity=0.1)
fig.show()
# create the K-means object
n_clusters_range = range(4,8,1)
# Create a dictionary to store K-means objects
kmeans_objects = {}
# create a figure for hosting subplots
plt.figure(figsize=(10, 10))
# create index for subplot locations
subLoc = 1
# Loop through the range, create K-means objects
# calculate the WCSS (Within-Cluster Sum of Squares) and Silhoutte Scores
# for each K-means value and print the information/scatterplot
for i in n_clusters_range:
kmeans_name = f"Kmeans_{i}"
kmeans_objects[kmeans_name] = KMeans(n_clusters=i, random_state=RS)
kmeans_objects[kmeans_name].fit(scaled_cap_data)
kmeans_label_att = kmeans_objects[kmeans_name].labels_
sil_score = silhouette_score(scaled_cap_data, kmeans_label_att)
print(f"{kmeans_name}: {kmeans_objects[kmeans_name]}")
print(f"WCSS for {kmeans_name}: {kmeans_objects[kmeans_name].inertia_}")
print(f"Silhouette score for K={i} is {sil_score}")
print("\n")
# Adjust the subplot grid to 3x2
plt.subplot(2, 2, subLoc)
# Assigning the current cluster labels to the tsne_2d_data DataFrame
tsne_2d_cap_data['Clusters'] = kmeans_objects[kmeans_name].labels_
plt.title(kmeans_name) # set plot title
sns.scatterplot(tsne_2d_cap_data, x='Feature 1', y='Feature 2', hue='Clusters', palette="viridis");
plt.tight_layout(pad=2)
subLoc = subLoc + 1
WCSS Elbow Method
# calculate WCSS for a range of K values
wcss_list = []
# Iterate over a range of K values from 2 to 10
for i in range(2, 11):
# Create a KMeans clusterer object with current K value
clusterer = KMeans(n_clusters=i, random_state=RS)
# Fit the clusterer to the scaled data
clusterer.fit(scaled_cap_data)
# Append the inertia (WCSS) to the wcss_list
wcss_list.append(clusterer.inertia_)
# Plot the WCSS values against the number of clusters
plt.plot(range(2, 11), wcss_list, marker='o')
plt.title('The Elbow Method') # Set the title of the plot
plt.xlabel('Number of clusters') # Label the x-axis
plt.ylabel('WCSS') # Label the y-axis
plt.xticks(range(1, 11)) # Set the x-ticks from 1 to 10
plt.grid(True) # Enable grid lines on the plot
plt.show() # Display the plot
Silhouette Method
# calculate Silhouette Scores for a range of K values
sil_score = []
# Iterate over a range of K values from 2 to 10
for i in range(2, 11):
# Create a KMeans clusterer object with current K value
clusterer = KMeans(n_clusters=i, random_state=RS)
# Fit the clusterer to the scaled data
clusterer.fit(scaled_cap_data)
# Calculate Silhouette Score
score = silhouette_score(scaled_cap_data, clusterer.labels_)
# Append the Silhouette Score to the sil_score list
sil_score.append(score)
# Plot the Silhouette Scores against the number of clusters
plt.plot(range(2, 11), sil_score, marker='o')
plt.title('The Silhouette Method') # Set the title of the plot
plt.xlabel('Number of clusters') # Label the x-axis
plt.ylabel('Silhouette Score') # Label the y-axis
plt.xticks(range(2, 11)) # Set the x-ticks from 2 to 10
plt.grid(True) # Enable grid lines on the plot
plt.show() # Display the plot
tsne_3d_data['Clusters'] = kmeans_objects['Kmeans_5'].labels_
fig = px.scatter_3d(tsne_3d_data, x='Feature 1', y='Feature 2', z='Feature 3', color='Clusters', opacity=.1)
fig.show()
# Add the cluster labels back to the DataFrame
slimCapData['Clusters'] = kmeans_objects['Kmeans_5'].labels_
# Display the slimData DataFrame with original values
slimCapData.head()
# checking the distribution of the categories in Clusters
print(100*slimCapData['Clusters'].value_counts(normalize=True), '\n')
# plotting the count plot for clusters
sns.countplot(slimCapData, x='Clusters', palette='viridis').set_title("Distribution Of The Clusters");
# Prepare for plotting boxplots of numerical variables for each cluster
plt.figure(figsize=(12, 10)) # Set the figure size for the plot
plt.suptitle("Boxplot of variables for each cluster") # Set the main title for the plot
# Iterate over each numerical variable in the dataframe
for i, variable in enumerate(slimCapData.columns.to_list()[:-1]):
plt.subplot(4, 3, i + 1)
sns.boxplot(slimCapData, x="Clusters", y=variable, palette='viridis') # Create a boxplot for current variable and cluster
# Adjust layout of subplots to improve spacing
plt.tight_layout(pad=2.0)
# Prepare for plotting barplots of numerical variables for each cluster
plt.figure(figsize=(12, 10)) # Set the figure size for the plot
plt.suptitle("Barplots of all variables for each cluster") # Set the main title for the plot
for i, variable in enumerate(slimCapData.columns.to_list()[:-1]):
plt.subplot(4, 3, i + 1)
sns.barplot(data=slimCapData, x="Clusters", y=variable, palette='viridis', errorbar=None)
plt.tight_layout(pad=2.0)
Visual Analysis of Scatterplots
Elbow Method Analysis
Silhouette Method
Recommendation: K=5
Practical Business Implications The cluster profiling revealed that there is little distinction between loan-taking customers and non-loan customers across the demographic and behavioral variables available in the dataset.
Final Model Selection: Post-pruning Decision Tree
Cluster model: The post K-means profiling analysis showed the clusters provided little distinction between loan-taking customers and non-loan customers across the demographic and behavioral variables available in the dataset. This eliminated this model.
Pre-pruning vs Post-pruning Decision Trees:
The post pruning model proved superior in all measures.
Feature Importance
Primary Target Segments (296 customers, 100% Acceptance Rate in model)
Secondary Target Segments High-Probability Prospects (70-100% Acceptance Rate)
Execution Strategy
Phase 1: Tier 1 Precision Targeting (Weeks 1-2)
Phase 2: Tier 2 Expanded Outreach (Weeks 3-4)
Phase 3: Ongoing refinment and optimization
Key Marketing Insights
Bottom Line